An introduction to the package {targets}
Use it to coordinate your data analysis projects
🧑💻
🎯
“{targets} implicitly nudges users toward a clean, function-oriented programming style that fits the intent of the R language”
code = results 🔬If you want to follow along in your local R environment, you can clone the repository at this link:
Tip
Quick plug that you can use my package to create a pre-populated R project directory!
More information here: https://github.com/JT-39/dau-R-template-ext
The project directory could look something like…
source(here::here("src/R/functions.R"))
# Path to absence data
absence_data_file_path <- here::here("_data/raw/1_absence_3term_nat_reg_la.csv")
# Extract national absence and format date
df_nat_absence <- get_nat_absence_data(absence_data_file_path) |>
format_time_period()
# Fit a linear model
model <- fit_model(df_nat_absence)
# Plot the data and model
plot_model(model, df_nat_absence)# Pull national absence from file
get_nat_absence_data <- function(file_path) {
read.csv(file = file_path) |>
dplyr::filter(geographic_level == "National")
}
# Extract the start year from academic year
extract_year <- function(date) {
paste0(substr(date, 1, 4))
}
# Format the year as a date
format_time_period <- function(data) {
data |>
dplyr::mutate(Date = lubridate::year(as.Date(extract_year(time_period),
format = "%Y")),
.after=time_period)
}
# Fit the model and pull coefficients
fit_model <- function(data) {
lm(sess_overall_percent_pa_10_exact ~ Date, data) |>
coefficients()
}
# Round to the nearets multiple of five
round_to_multiple_five <- function(x) {
ceiling((x + 1)/5)*5
}
# Plot the data and model
plot_model <- function(model, data) {
ggplot2::ggplot(data) +
ggplot2::geom_point(ggplot2::aes(x = Date,
y = sess_overall_percent_pa_10_exact,
colour = school_type)) +
ggplot2::geom_line(ggplot2::aes(x = Date,
y = sess_overall_percent_pa_10_exact,
colour = school_type)) +
ggplot2::scale_colour_manual(values = kasstylesr::color_picker(4),
breaks = c("Total", "State-funded primary",
"State-funded secondary", "Special")) +
ggplot2::geom_abline(intercept = model[1],
slope = model[2],
show.legend = T,
colour="red",
linetype="dashed") +
ggplot2::annotate("text",
x = max(data$Date),
y = lm(sess_overall_percent_pa_10_exact ~ Date,
df_nat_absence) |>
fitted.values() |>
max(),
hjust = -0.45,
label = "Line of best fit",
colour = "red") +
ggplot2::scale_y_continuous(breaks = scales::pretty_breaks(),
limits = function(x) {
c(0, round_to_multiple_five(max(x)))
}) +
ggplot2::coord_cartesian(clip = 'off') +
ggplot2::theme_minimal() +
kasstylesr::kas_style() +
ggplot2::labs(
title = "Average persistent absence over time in England",
subtitle = "Split by school type. Only includes persistent absentees.",
x = "",
y = "Overall absence rate (%)",
colour = "School type:"
)
}tar_target()
namecommand ~ the function that generates the target> dispatched target data_file
o completed target data_file [0.5 seconds]
> dispatched target nat_data
o completed target nat_data [0.62 seconds]
> dispatched target nat_data_clean
o completed target nat_data_clean [0.02 seconds]
> dispatched target model
o completed target model [0 seconds]
Saving 7 x 7 in image
> dispatched target plot
o completed target plot [0.76 seconds]
> ended pipeline [2.21 seconds]
targets} creates a pipeline of pure functions_targets/objects/
.rds format_targets/ to .gitignore (for GitHub) ☝️Advanced
format = "file")targets} will not track the data (any changes to it)targets} keeps track of changes in files and functions 🕵️graph LR
style Legend fill:#FFFFFF00,stroke:#000000
style Graph fill:#FFFFFF00,stroke:#000000;
subgraph Legend
direction LR
xf1522833a4d242c5([Up to date]):::uptodate --- xd03d7c7dd2ddda2b([Stem]):::none
xd03d7c7dd2ddda2b([Stem]):::none --- xeb2d7cac8a1ce544>Function]:::none
end
subgraph Graph
direction LR
xb1fbb690b4ec8c10>extract_year]:::uptodate --> xd20a83ce47f3194c>format_time_period]:::uptodate
xf7d598eca7911241>round_to_multiple_five]:::uptodate --> xec203b5a68d60f72>plot_model]:::uptodate
xd20a83ce47f3194c>format_time_period]:::uptodate --> xc2980a3d74445b80([nat_data_clean]):::uptodate
x83c942fcaf37c3dc([nat_data]):::uptodate --> xc2980a3d74445b80([nat_data_clean]):::uptodate
x0d01c84c9424364d([data_file]):::uptodate --> x83c942fcaf37c3dc([nat_data]):::uptodate
x9242f8c59a209716>get_nat_absence_data]:::uptodate --> x83c942fcaf37c3dc([nat_data]):::uptodate
x9043e9d6bef6a839([model]):::uptodate --> x667cd56a75e2bb2b([plot]):::uptodate
xc2980a3d74445b80([nat_data_clean]):::uptodate --> x667cd56a75e2bb2b([plot]):::uptodate
xec203b5a68d60f72>plot_model]:::uptodate --> x667cd56a75e2bb2b([plot]):::uptodate
x12e88730e39644dc>fit_model]:::uptodate --> x9043e9d6bef6a839([model]):::uptodate
xc2980a3d74445b80([nat_data_clean]):::uptodate --> x9043e9d6bef6a839([model]):::uptodate
end
classDef uptodate stroke:#000000,color:#ffffff,fill:#354823;
classDef none stroke:#000000,color:#000000,fill:#94a4ac;
linkStyle 0 stroke-width:0px;
linkStyle 1 stroke-width:0px;
targets} knows which parts of the pipeline can be ran in parallelNeed to load a few more packages:
Utilise the {targets} function to run in parallel:
Simple as that! 💥
🔮
.Rmd or .qmd to the pipelinetargets} computation 🔋sqltargets} which applies {targets} principles to .sql files 📦🎯
{targets} manual: Link
YouTube {targets} walkthrough: Link
Ofsted MI {targets} example pipeline GitHub: Link
These slides and mini {targets} example GitHub: Link
{sqltargets} GitHub: Link
Building reproducible analytical pipelines with R: Link
Email me at:
jake.tufts@education.gov.uk
Ofsted MI {targets} example pipeline GitHub